Predicting Concrete Compression Strength using Regression Models

Comparison of Linear, Polynomial and Non-parametric Regression Models on Performance

Data Description:

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

Domain: Cement manufacturing

Context:

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Attribute Information:

Learning Outcomes:

Step 1: Import the necessary Libraries

Step 2: Load the dataset

Checking the shape of the dataset

Checking the datatypes and null records

Checking total number of Null values in the dataset

Checking if the dataset has only Numeric Data

Observations:

Step 3: Statistical Summary (Five Number Summary) of the Dataset

Observations:

Step 4: Exploratory Data Analysis (Univariate & Multivariate)

Univariate Analysis

Observations:

Shapiro Test for Normality of Features

None of the variables pass Shapiro test for Normality at 0.01 Significance level.

Skewness Check

Outliers Detection & Treatment

Observations:

High Leverage / Influence Points Detection using Cook's Distance

Observations:

Both of these will indicate that there is multicollinearity (auto-correlation) among the predictor variables.

Setting up a Leverage Cutoff to identify High Leverage points. Usually it is (3k+1)/n where k is No of Columns and n is No of Rows

Since all of these points have a lower Studentised t-residual value, there is no need to remove them from the dataset.

Multivariate Analysis

Pair Plot

Observations:

As observed earlier, it can be seen that strength variable displays positive relationship only with cement.

Quantile-Quantile Plot to check Distribution Similarity among Features

Except for cement and water none of the other variables display similar distributions with other features. We can try to derive a new feature based on these two features.

Correlation Matrix

coarseagg is one variable that exhibits negative correltion with every other feature.

Highly correlated features with target variable strength

strength of the concrete increases with increase in cement in kg/m3. age (data point size) also seems to have slightly positive correlation with strength.

Detecting Multicollinear features using Variance Inflation Factor (VIF)

Step 5: Split the data into Train, Validation and Test set

Step 6: Feature Engineering & Data Preprocessing

Log Transformation for treating Skewness & Minimizing Outliers

Derived Feature

Domain-specific technical know-hows:

From the correlation matrix, we can observe that correlation between

Derived Feature: water_cement_ratio:

Binning - age Feature

Ordinal Encoding age_group (Transformed Feature)

Note: age variable is binned here, however not used in the final model since binning results in significant loss of information and resulting in degraded model performance.

Robust Scaling the Features

Unsupervised Learning Methods for Featurization

Cluster Map to understand Feature Dependence
K-Means Clustering

Finding the Optimal Number of Clusters

Observations:

Silhouette Score

Observations:

Observations on the different Clusters:

Let us plot the Clusters by considering two of the major features age and water_cement_ratio

We can see that the Clusters identified by KMeans is not proper as several of the data points are overlapping (since it considers only 'mean').

Gaussian Mixture Models consider the mean and variance (distribution) of the data points.

Gaussian Mixture Models

Observations:

Step 8: Building Baseline Model & Deciding on Model Complexity

Baseline Models - both linear and tree based would be built to check on the Model Complexity.

Building a Baseline Regressor - Linear Regression - Original Dataset

Building a Baseline Regressor - Linear Regression - Feature Engineered Dataset

Building a Polynomial Regressor

Building a SVR Radial Basis Functional Kernel Regressor

Building a Tree Regressor

Predictions vs Ground Truth Plot for Baseline Models

Observations:

Deciding on Model Complexity - Linear or Non-Linear

Plotting the Residuals to understand Linear Model Assumptions
Plotting the higher order of Polynomials to understand Linear Model Assumptions

Step 7: Feature Importance & Feature Selection

Let us use RandomForestRegressor since tree-based baseline model has been the best of the lot.

Feature Selection by Feature Importances

Feature Selection by Permutation Importance

Model-based Feature Selection

SelectFromModel Scikit-Learn
SequentialFeatureSelector MLXtend

Consensus feature selection from all the methods

Step 8: Model Building & Hyperparameter Tuning on Train & Validation set

RandomForestRegressor

RF Regressor with Hyperparameter Tuning

XGB Regressor

XGB Regressor with Hyperparameter Tuning

LightGBM Regressor

LightGBM Regressor with Hyperparameter Tuning

Step 9: Model Selection & Evaluation on Test set

Model Selection using Cross Validation

Plotting the Learning Curve

Source: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html

Observations:

Plotting the Cross Validation scores between RF, XGB and LGBM

Hypothesis Testing using paired_ttest_5x2cv on Validation dataset

Model Evaluation on Test set

In line with the Cross Validation scores of XGB Regressor, it is also the better performing Regressor for unseen Test dataset and is recommended as the best algorithm among the tested algorithms.


Building Pipeline for finding the Best Model

Let us validate our findings on XGB being the better performing model using Scikit-Learn Pipelines & GridSearchCV.

Fitting Pipeline on Train dataset

As expected, XGB Regressor is the best model among RF, XGB and LGBM for the given dataset.

Cross Validation Score using Validation set
Model Evaluation on Test Dataset

Step 10: Learnings and Summary

Statistical Summary and Initial EDA:

Univariate Analysis:

Multivariate Analysis:

Feature Engineering:

Outlier Treatment:

Unsupervised Learning Methods for EDA & Featurisation:

Feature Importance & Feature Selection:

Baseline Model Building & Deciding on Model Complexity:

Model Building - with Non-Parametric Models:

Model Evaluation & Selection:

Model Pipeline & GridSearch